🧩NVIDIA Nemotron 3 Nano - How To Run Guide

Run & fine-tune NVIDIA Nemotron 3 Nano locally on your device!

NVIDIA releases Nemotron 3 Nano, a 30B parameter hybrid reasoning MoE model with ~3.6B active parameters - built for fast, accurate coding, math and agentic tasks. It has a 1M context window and is best amongst its size class on SWE-Bench, GPQA Diamond, reasoning, chat and throughput.

Nemotron 3 Nano runs on 24GB RAM/VRAM (or unified memory) and you can now fine-tune it locally. Thanks NVIDIA for providing Unsloth with day-zero support.

Running TutorialFine-tuning Nano 3

NVIDIA Nemotron 3 Nano GGUF to run: unsloth/Nemotron-3-Nano-30B-A3B-GGUF We also uploaded BF16 and FP8 variants.

⚙️ Usage Guide

NVIDIA recommends these settings for inference:

General chat/instruction (default):

temperature = 1.0
top_p = 1.0

Tool calling use-cases:

temperature = 0.6
top_p = 0.95

For most local use, set:

max_new_tokens = 32,096 to 262,144 for standard prompts with a max of 1M tokens
Increase for deep reasoning or long-form generation as your RAM/VRAM allows.

The chat template format is found when we use the below:

tokenizer.apply_chat_template([
    {"role" : "user", "content" : "What is 1+1?"},
    {"role" : "assistant", "content" : "2"},
    {"role" : "user", "content" : "What is 2+2?"}
    ], add_generation_prompt = True, tokenize = False,
)

Nemotron 3 chat template format:

Nemotron 3 uses <think> with token ID 12 and </think> with token ID 13 for reasoning. Use --special to see the tokens for llama.cpp. You might also need --verbose-prompt to see <think> since it's prepended.

<|im_start|>system\n<|im_end|>\n<|im_start|>user\nWhat is 1+1?<|im_end|>\n<|im_start|>assistant\n<think></think>2<|im_end|>\n<|im_start|>user\nWhat is 2+2?<|im_end|>\n<|im_start|>assistant\n<think>\n

🖥️ Run Nemotron-3-Nano-30B-A3B

Depending on your use-case you will need to use different settings. Some GGUFs end up similar in size because the model architecture (like gpt-oss) has dimensions not divisible by 128, so parts can’t be quantized to lower bits.

Llama.cpp Tutorial (GGUF):

Instructions to run in llama.cpp (note we will be using 4-bit to fit most devices):

Obtain the latest llama.cpp on GitHub here. You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inference.

apt-get update
apt-get install pciutils build-essential cmake curl libcurl4-openssl-dev -y
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build \
    -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON -DLLAMA_CURL=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-mtmd-cli llama-server llama-gguf-split
cp llama.cpp/build/bin/llama-* llama.cpp

You can directly pull from Hugging Face. You can increase the context to 1M as your RAM/VRAM allows.

Follow this for general instruction use-cases:

./llama.cpp/llama-cli \
    -hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:UD-Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 32684 \
    --temp 1.0 --top-p 1.0

Follow this for tool-calling use-cases:

./llama.cpp/llama-cli \
    -hf unsloth/Nemotron-3-Nano-30B-A3B-GGUF:UD-Q4_K_XL \
    --jinja -ngl 99 --threads -1 --ctx-size 32684 \
    --temp 0.6 --top-p 0.95

Download the model via (after installing pip install huggingface_hub hf_transfer ). You can choose UD-Q4_K_XL or other quantized versions.

# !pip install huggingface_hub hf_transfer
import os
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
from huggingface_hub import snapshot_download
snapshot_download(
    repo_id = "unsloth/Nemotron-3-Nano-30B-A3B-GGUF",
    local_dir = "unsloth/Nemotron-3-Nano-30B-A3B-GGUF",
    allow_patterns = ["*UD-Q4_K_XL*"],
)

Then run the model in conversation mode:

./llama.cpp/llama-cli \
    --model unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf \
    --threads -1 \
    --ctx-size 16384 \
    --n-gpu-layers 99 \
    --seed 3407 \
    --prio 2 \
    --temp 0.6 \
    --top-p 0.95 \
    --jinja

Also, adjust context window as required. Ensure your hardware can handle more than a 256K context window. Setting it to 1M may trigger CUDA OOM and crash, which is why the default is 262,144.

Because the model was trained with NoPE, you only need to change max_position_embeddings. The model doesn’t use explicit positional embeddings, so YaRN isn’t needed.

🦥 Fine-tuning Nemotron 3 Nano and RL

Unsloth now supports fine-tuning of all Nemotron models, including Nemotron 3 Nano. The 30B model does not fit on a free Colab GPU; however, we still made an 80GB A100 Colab notebook for you to fine-tune with. 16-bit LoRA fine-tuning of Nemotron 3 Nano will use around 60GB VRAM:

Nemotron-3-Nano-30B-A3B SFT LoRA notebook

Google Colabcolab.research.google.com

On fine-tuning MoE's - it's probably not a good idea to fine-tune the router layer so we disabled it by default. If you want to maintain its reasoning capabilities (optional), you can use a mix of direct answers and chain-of-thought examples. Use at least 75% reasoning and 25% non-reasoning in your dataset to make the model retain its reasoning capabilities.

✨Reinforcement Learning + NeMo Gym

We worked with the open-source NVIDIA NeMo Gym team to enable the democratization of RL environments. Our collab enables single-turn rollout RL training for many domains of interest, including math, coding, tool-use, etc, using training environments and datasets from NeMo Gym:

NeMo Gym Sudoku Reinforcement Learning notebook

Google Colabcolab.research.google.com

Also check out our latest collab guide published on NVIDIA’s official Developer blog:

How to Fine-Tune an LLM on NVIDIA GPUs With Unsloth

How to Fine-Tune an LLM on NVIDIA GPUs With UnslothNVIDIA Blog

🎉Llama-server serving & deployment

To deploy Nemotron 3 for production, we use llama-server In a new terminal say via tmux, deploy the model via:

./llama.cpp/llama-server \
    --model unsloth/Nemotron-3-Nano-30B-A3B-GGUF/Nemotron-3-Nano-30B-A3B-UD-Q4_K_XL.gguf \
    --alias "unsloth/Nemotron-3-Nano-30B-A3B" \
    --threads -1 \
    --n-gpu-layers 999 \
    --prio 3 \
    --min_p 0.01 \
    --temp 0.6 \
    --top-p 0.95 \
    --ctx-size 16384 \
    --port 8001 \
    --jinja

When you run the above, you will get:

Then in a new terminal, after doing pip install openai, do:

from openai import OpenAI
import json
openai_client = OpenAI(
    base_url = "http://127.0.0.1:8001/v1",
    api_key = "sk-no-key-required",
)
completion = openai_client.chat.completions.create(
    model = "unsloth/Nemotron-3-Nano-30B-A3B",
    messages = [{"role": "user", "content": "What is 2+2?"},],
)
print(completion.choices[0].message.content)

Which will print

User asks a simple question: "What is 2+2?" The answer is 4. Provide answer.

2 + 2 = 4.

Benchmarks

Nemotron-3-Nano-30B-A3B is the best performing model across all benchmarks, including throughput.

PreviousTutorial: gpt-oss RL NextDevstral 2

Last updated 2 hours ago

Was this helpful?